You never know whether you've got these things on right or not. You never know whether they're on right or not. Feels wrong. Okay, so I don't need that. Okay. So David came in to me and to say uh i something we'd been talking about for a while was some stuff I'd been working on about using these topic models with uh with networks, with n things with links between them, documents that are linked. Uh n uh it's actually it's related to rapple, but it's not actually that th this is something I haven't actually been working on for a couple of months now, uh just because I came up with something more interesting that I've been working on, but uh th at the time basically I'd been trying to use P_L_S_A_ and those types of models on web pages, and that came up with uh a big problem in that most web pages have very very little text on them. But you can tell what they're about based on the other texts on the site, the texts on the things they're linked to, so what I wanted to know was can you combine the topic models that the pages, that have lots of text on them, with something else that that tells you what a page that has no or very little text on it based on the links, yeah. So quite simply if you have a page one which is like a home page, and that'll maybe have five links and a picture on it, and then because it links to page two and page three. Can you tell what this is about? Yeah, so I started looking at uh Google page rank, which kinda does this by saying well, the links go the other way, it says if this page is about fish or in Google, they say if this page is good in some way, and it links to this other page, and we don't know what this is, then that page is more likely to be good. And it's just page rank is just a simple algorithm for doing this over the whole big network. So I wanted to combine the topic models that tell you beforehand what you think this page and this page are about with this to maybe get a better idea. So the idea is say you've got some text on a page, that's only part of the picture about what of what the page is really about, and the links pointing to that page are also part of the picture. So I basically came up with an algorithm that's just a i it plugs P_L_S_A_ uh plugs page rank into P_L_S_A_, right, so page rank you basically just d uh say You basically take the how good this page is, which is X_, and how good this page is, which is Y_, and you then say this page is based on uh how do you do this? Based on the the links coming into it, uh it's just a like a sum over all the the links of uh well, X_ link I guess. O of of the value. Uh uh the but it's not it's Yeah. If if so if this is page rank of X_, page rank of Y_. I think this is roughly it off hand. Page rank of Z_. So you say page rank of Z_ is equal to uh one over a constant to normalise it. Uh sum over all the links into Z_ of page rank of the link. Yeah, tha it's like that, yeah? So similar idea. If you got a vector that says what the topics are, say you've learned from P_L_S_A_ a d a distribution of topics for every page, so if you say a probability of topic one for uh for page uh Z_ uh is equal to, and you can use the same sort of thing. Right, so nicking a whole bunch of maths from how you work out page rank, uh it comes up with uh just uh an iterative algorithm, which is just uh the current probability of topic one given Z_, so that's probability old plus uh i yeah, plus that and then you normalise between them, so you say alpha and one minus alpha. Yeah. So w basically what you're saying here, one minus alpha say link contribution you can call this I guess. So what you're saying is that a page is partly from its text that you've got on the page, and it's partly from what the links say it's about. So that alpha thing is how much do you trust the text on the page, and how much do you trust the links. So that alpha could be different for every page. So what I did then was try this on a big database of web pages. Uh and I had some results from that, but they were all on temp two, so that's not important right now. Uh but as it is not really important, I can regenerate that. The problem was that this gives you a new vector of topics for every page. So you've got two vectors, you've got the one that you've just got from P_L_S_A_ or whatever model, and you've got the one that you get from this. Uh how do you tell which one's better? I had no way to evaluate this well. How do you tell that this page because you don't know where the topics are, 'cause they're learned automatically, uh so It it is, yeah, because uh b yeah, you initialise it with uh actually thi this is kinda wrong, this is not old, but start this is how you this is how you you start it off. So this this is what P_L_S_A_ gives you to start with, and then you iterate this and it becomes something else. Yeah. Mm-hmm. That's it, exactly, yeah. Exactly, yeah, so you've now got you've got P_L_S_A_ gives you two probabilities, right? So it gives you a probability of a topic given a document, which is Z_ given D_ and a probability of a word given a topic. So this is the same for both models. Yeah? Well yeah, what you do is you're taking this and iterating over it, the probability of the topic given the document. Uh so have I made that clear, I'm I'm not sure if I'm explaining this well. Okay. Right, so y i you m Yeah, so we were struggling at how do you evaluate this, because uh I mean if you look at the way that P_L_S_A_ and all those kind of models have been evaluated, it's th they never quantitative quantitatively do anything, they always say uh well, look at the clustering, there's here we've got some documents that all appear to be about the same thing, and here's where we get some other documents that all appear to be about the same thing, and you can't really tell exactly how good that is, you can just get an idea that it's Yeah, so tha that's the first thing th that thing when I spoke to you about this, you suggested d use the likelihood and Pedro came up with a really good example why that could be bad. So you have uh you have the uh oh, let me thing of what this was. It was to do with cars basically, if you have you have a topic that's uh representing cars, and you can tell that that's representing cars by looking at this distribution, and it's the same for both. Now you wanna tell which i i this page, uh Pedro's example of what it was, is that is says on the page Lotus Elise. Okay? And that uh no f I can't I just remember it was a really can you ig remember what it was? No, okay. Yeah, I didn't really consider doing it on perplexity, actually, that to be hon but okay. Okay. Is that fair though, because I think one of the things with these topic models is that you don't know you can't make them cluster Okay. Hmm. Yeah. Okay. And that is is that really fair though? Can you say that if a set of documents if the clustering of these documents is closer to the effecti what you what you've got there is uh a clustering of documents a as your devaluation, yeah? Uh thi this these labels, the Yahoo categories, that's that's a clustering. So you're by saying that this clustering is closer to that clustering is is does that necessarily mean that it's better? Mm-hmm. Okay. Mm-hmm. Mm-hmm. Mm-kay. Mm. We uh I think your point was that you you train this without labelling first, yeah? Mm. Mm-hmm. Well th yeah, the the thing that worries me about this approach is that say the really really really simple example of this where your directory had two categories in it, uh good and bad, and and and you run P_L_S_A_ on it, now that doesn't do it good and bad, it does, I dunno, English and French pages, so they don't match, but the clustering is still very good in in one respect. Uh but but you you can extend that up, even if you have a hundred categories, P_L_S_A_ can find a hundred clusters that are a good clustering, but are completely unrelated to the directory. Hmm. But but what I wanna know i is that which of them has the better representation of a given page. So so given a random page, uh and these two models, uh and the wo the words that these two models give, which is more likely to Mm-hmm. Mm. Mm-hmm. Mm-hmm. Okay. Yeah. Okay. S okay. I'm I'm just n concerned that that's uh that that's it's answering a question that evaluates in some way, but maybe not in the way that Well that maybe not is the same way I was thinking, but maybe it's a better way um so what I mean by that is that sorry, I'm trying to process all this. Uh Yeah, i it's the same thing, like it it does tell you whether this uh which clustering is better for this task, but it doesn't tell you which clustering better represents the pages necessarily. But it it's it you you could say that it possibly doe it probably does. Okay. Mm. Yeah i Mm. Okay. Mm. I think mm. Uh no, I th I'm I'm coming round to it actually. No, I'm so I I I think I g uh I I get it. Uh I I th I think it probably is fair, I'm just I I've gotten um uncomfortable with it, but I think that'll go if I think about it enough. Uh but but yeah, the it it but it just seems a little suspect, but I think that's just from the whole point that you can't tell which clustering is b better. But uh what I'd thought before was could you get people to evaluate this by by hand basically, could you give people a set of pages and like say you did it with five topics for example. Uh now you give people five sheets uh you give uh a t a a user five sheets of paper, each with words in varying sizes representing the probability of that topic for example. Try give the user these pieces of paper to get an idea of what these topics are, that's this P_ of W_ given Z_. And then you give them some pages, and say which one would you which piece of paper would you put that with, and then you get like an intuitive ide uh like how would a person cluster pages given the what these topics supposedly are. 'Cause i you've got something that's fixed, the the P_ of W_ given Z_ with respect to both models. So I was tr I was hoping you could use that to Mm. Mm-hmm. Mm-hmm. Yeah, okay. Hmm. Okay. See, the the thing about web pages I thought as well in that getting humans to evaluate it in some way is that you have most of the web pages that are uh well a lot of the web pages that are important in some way are are are home pages, so Google dot com doesn't actually say the words search on it, I think. I maybe it says go, but you only know it's about search through the thing, but someone looking at it can tell that it's a search engine, because of the way it's laid out, because of the pictures on it and because of all of that. There's a lot of information that's completely lost in this bag of words representation, and I was hoping that in some way you could use that to evaluate this. I have no idea how, but uh yeah. Mm. Yeah. Mm. Yeah. Okay, I think I know it. But it yeah, thi this is effectively what we have here though, and th that directory is one clustering of the marbles. And you're hoping that it it so what you're saying is that I have two clusterings of marbles for my two models, and I have a test clustering of marbles. It's saying that one of these is more close to the test t t to the the one that we've been given, is that it's saying that what does that actually say? That doesn't uh it says it's g better for this task, but it doesn't really say anything else. Mm-hmm. Okay. Okay. Okay. Uh and so and so if I get that right, what what the algorithms are still coming up with is in this way you're looking at a clustering of marbles, yeah? If each of the algorithms comes up with a defin Ah okay, right, that makes sense. Okay. Okay. Okay. I'm convinced, I'm convinced, alright. So it's good. Right. Uh alright, well thanks. Um I'm trying to think of anything I haven't said about the stuff, but that uh from looking at the results I got it looked quite good, um just like hand reading them in that one of the things I did was I started a i a web crawl at the IDIAP homepage. It did not very many pages, only like two thousand or something like that, but enough to get the pages that were close to IDIAP to have a lot of links between them. So that there was enough to be able to run this link clustering algorithm. So what you find is that the pages around IDIAP, there's a lot of English pa speaking pages and a lot of French speaking pages, so of course this uh P_L_S_A_ completely separated those out into into categories. Uh so when I looked at the the word likelihoods, there were two topics. One was very clearly English words that most likely word was the, and one was very clearly French words most likely words was de I think and uh No, I had many topics. The others looked like garbage, couldn't tell what they might have been. But anyway, uh tha that's often the case, when you look at it you can tell very clearly that one or two topics are a b not the rest, yeah. It is. Ok Hmm. That w uh anyway, that when I run it on that, uh what I found was, looking through some of the sites uh I looked for the sites that the models differed on, that where one said it was probably an English phrase and the other said it was probably a french phrase, and I found a good few examples where you had a site that looked uh well one that I remember off hand was it was a page and it was a big picture, and links down the site, but the links were all pictures. So you couldn't the the words were in French and the links, but there were no words picked up by the when I did the crawl for the model. And there was a small bit down here, and in this there was uh meant to be picture in there, but it didn't load uh or uh b it was meant to be something else, and it said uh in English it said four O_ four this page cannot be found. So all the words on this page were English, but it was very very clearly a French page. So the first model said it was an English page 'cause all the words were in English, second model said very definitely a French page because all of these pointed to French sites and it was only pointed to it was something dot F_R_ as well, it's Yeah. So I I found a couple of examples that were kinda like that. Mm but the the problem that I found when I ran it, uh the th the thing that was a bad point was that if you have uh say tryin I was trying to say if you could evaluate this with search given some keywords like which model would more likely find you a p good document, that that sort of approach. Uh what I found was that say you had uh Uh th tha that's not too relevant necessarily, but if you have say a topic distribution for uh one page, and it very very highly scores on the P_L_S_A_ for one of the the topics, and that's right, okay, this model, because it's averaging out over all the pages nearby, all of the topic distributions become more flat. It it becomes more uncertain about everything. So I don't know if that's necessarily a good thing or not, yeah. In in some cases it's gonna be a bad thing, b because ev every page is linked by the way the algorithm works, every page is linked eventually to everything else, so it every page has information incorporated in it. Or i the distribution learned for every page has information incorporated in it, from every other page's distribution as well, so it averages out, it flattens all the distributions, depending on what this alpha is. Mm-hmm. Well I I think that the thing is Yeah, that that there's a balance in the uh for example take a page like uh Slashdot, and it's like a news site uh for geeks, and you have uh nerds, yeah. News for nerds, stuff that matters. Uh all of the like hundreds of thousands of pages point to that, but they're about completely different things, so you in effect are losing information about what that page is about, because there's so many things pointing to it and they all average out as Yeah, so that Th that's my first idea w was how I set this I remember for the Lotus thing wasn't there? Um so I thought you wanna probably set this alpha per page, there's probably some heuristics, for example this page, there's no text in it, will set it very low in user information for the links. Problem is, this is what Pedro immediately said, what if you have a page that looks like this? It's got a big picture of a car uh Anyway, and it says Lotus Elise, so you know that it's definitely about car, but it's only got two words. So you can't just do it by words, because those two words completely specify into something or other. Uh su you suggested doing likelihood, but at that time and I can't remember what my problem was with that. It might not have been a good thing, but you've got two likelihoods for it as well. No, maybe not. Mm. Yeah, maybe do it by cross-validation, something. Okay, there's Mm. Hmm. Yeah, that that is fair. Yeah, okay. It yeah, d No. Yeah, i so uh I I think that that's too easy. You can do it with some heuristics and you can learn it from the task results in or combine the two, so tha that's kinda straightforward, I think. Yeah, the the the thing about doing it through cross-validation or something like that with the task is uh the problem with doing a page rank style algorithm is that is that it gets better it it converges better and it gives you better results, the closer a network is to being a small world network. So small world network mean basically some far links, but generally very clustered locally. Uh now the thing about doing a web crawl is, the bigger your web crawl, the closer it'll be to the overall structure of the net n the internet, which is a small world network. The smaller your web crawl, the less small world network like it's gonna be. So the more pages you get, the better your results. And people running this style of algorithm, the page rank style stuff, generally run it on about two million or twenty million pages, or whatever. The the Stanford web base data set that I was using for this, for for a bit of this um, they have a a d I can't remember off hand, but they have m maybe a bit over a theta byte worth of web pages in this, a bunch of servers. Uh Yeah, maybe Wikipedia is a good thing to do. Was I was concerned about Wikipedia though was that it's while it has a high level of inside links that has a lot of directory style links as well that might mess up it actually being a see wh while you have Ah okay, that's good. That plus I guess it's actually straightforward to say to start with is this a small world network and then run it on that. You were running that on what we what was it, like half a million pages, something like that? Is it and that was for half of it? Uh if I remember. Okay. Okay, the whole thing was Mm. Mm-hmm. Okay. I mean I think off hand thinking about this model it would have to work better, because two things in the same category are going to be linked. So it it yeah. Okay. Yeah, that's I think tha it's gonna be more true in some networks and in some tasks than others, and I think in the Wikipedia it's probably gonna be it's a good one to test on. Mm-hmm. Okay. Okay. Hmm. Mm-hmm. Okay. Yeah, it I'll maybe I'll have a look at the Wikipedia stuff anyway. I'm stuck doing something else for the next month or two anyway, so. Uh this is always the case. Uh the there's another thing, um uh this was new but I haven't thought about this too much, but it's on the same lines in that uh have have you guys seen this uh correlated topic models work? Correlated topic models. It's David Blei, the guy that came up with uh with L_D_A_. He's got uh a new model that's it's published in NIPS this year, but his co-author put it on their web site a month or two ago. So uh you know L_D_A_? D you all know L_D_A_ or yeah. Uh okay. So a L_D_A_ you basically draw a a Dirichlet which says like what type of document this is effectively, and then you draw a topic given that Dirichlet distribution and then you draw words given that. Alright, well that's for for every word and that's once for the document and that's you've got a parameter coming at this globally. So that that's roughly what the model is, plus other little bits. But anyway, so what they did was they they reworked this because one of the problems with L_D_A_, and I think P_L_S_A_ has this problem as well in is that uh the more topics you have, the less uh expressive it can be, because the topics it cluster it it pushes a document if it is pulling a document into one topic, it pushes another topic away from that. Doesn't that make sense? So the topics are necessarily trying to be different from each other. So the more topics you have, the more th they're sort of squeezing into the space of what things can be about uh and things are kinda pushed out, so once you m go over you try it one corpus, and once you go over about a hundred, maybe two hundred topics, uh the the new you add more topics and it doesn't get better in any way, the model doesn't. The n new topics are effectively going to just be garbage. Uh, alright, so what they did was they they they stopped using the Dirichlet and uh they're starting to use uh a Gaussian there. So it it in to spare the details basically, they've they've got a variable here, I think they call it eta uh and they from this generate a distribution of topics in some way, but it's generated from a Gaussian space, from a from a a space, so it's given a a mean and covariance uh and Yeah, uh so what it does, the this is the the bit that took me forever to understand actually, uh same model, but this is given a Gaussian space it maps that down into the simplex of what a topic can be about. The so the simplex basically if you've got a say you've got two topics. Now say i i say you get three topics, that makes more easier to draw. Uh for this is theta one theta two theta three, these are the probabilities that that a document is about uh this theta one is the probability that this document is about topic one. Theta two is so these guys all have to add up to one, yeah? Uh so that basically means that what a document is about lies on this space between them, yeah. So this is the simplex. Uh so basically a distribution generated by this guy is basically a point on this simplex. So what they do is they've got a distribution that maps uh a Gaussian and some space into a distribution on this thing, so it maps it into Gaussian in that. The cool thing about that is that if you've got two Gaussians, you can tell how close they are. So you can now tell how close two topic distributions are, right? So you can have as many as you like, uh and you can tell that they're close to each other, so there's load of ideas that I had about this uh in that if you can tell that two topic distributions are close to each other. So for example if you split a document into two parts and said that the two parts have to be close in the distribution, you can see how the doc t the document changes, for exa stu stuff based on that. With this idea, what if rather than working with this P_ of Z_ given D_ thing that you had from P_L_S_A_, but what if you worked from these Gaussians that you get for every document and you said that in some some function that said uh that the Gaussian representing this should be similar to the ones linking to it. Now I haven't thought about any of the maths for this, but but it should be uh straightforward I guess to to formulate a basic in terms of the mean should be similar and the covariance maybe should be similar as well. So it might actually be a better model again. I dunno. I'll send you guys the link for that correlated topic model stuff if you want, but uh I don't know if it's too interesting to you. Okay. Alright, well than thanks for hearing me out and the idea especially. Maybe I can do more with this. Alright. Yeah,. Yeah, b Bastien, you can't release this data for four months.